K-means Cluster Analysis and Mahalanobis Metrics: a Problematic Match or an Overlooked Opportunity?

نویسندگان

  • Andrea Cerioli
  • A. Cerioli
چکیده

In this paper we consider the performance of the widely adopted K-means clustering algorithm when the classification variables are correlated. We measure performance in terms of recovery of the true data structure. As expected, performance worsens considerably if the groups have elliptical instead of spherical shape. We suggest some modifications to the standard K-means algorithm which considerably improve cluster recovery. Our approach is based on a combination of careful seed selection techniques and use of Mahalanobis instead of Euclidean distances. We show that our method performs well in a number of examples where the standard algorithm fails. In such applications our nonparametric technique is seen to be competitive when compared to parametric model-based clustering methods. Hence, our conclusion is that use of the Mahalanobis distance should become a standard option of the available K-means routines for non-hierarchical cluster analysis. This goal can be achieved by minor modifications in popular commercial software.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Persistent K-Means: Stable Data Clustering Algorithm Based on K-Means Algorithm

Identifying clusters or clustering is an important aspect of data analysis. It is the task of grouping a set of objects in such a way those objects in the same group/cluster are more similar in some sense or another. It is a main task of exploratory data mining, and a common technique for statistical data analysis This paper proposed an improved version of K-Means algorithm, namely Persistent K...

متن کامل

Multiple ellipse fitting by center-based clustering

This paper deals with the multiple ellipse fitting problem based on a given set of data points in a plane. The presumption is that all data points are derived from k ellipses that should be fitted. The problem is solved by means of center-based clustering, where cluster centers are ellipses. If the Mahalanobis distance-like function is introduced in each cluster, then the cluster center is repr...

متن کامل

Investigating Distance Metrics in Semi-supervised Fuzzy c-Means for Breast Cancer Classification

In previous work, semi-supervised Fuzzy c-means (ssFCM) was used as an automatic classification technique to classify the Nottingham Tenovus Breast Cancer (NTBC) dataset as no method to do this currently exists. However, the results were poor when compared with semi-manual classification. It is known that the NTBC data is highly non-normal and it was suspected that this affected the poor result...

متن کامل

A Clustering Based Location-allocation Problem Considering Transportation Costs and Statistical Properties (RESEARCH NOTE)

Cluster analysis is a useful technique in multivariate statistical analysis. Different types of hierarchical cluster analysis and K-means have been used for data analysis in previous studies. However, the K-means algorithm can be improved using some metaheuristics algorithms. In this study, we propose simulated annealing based algorithm for K-means in the clustering analysis which we refer it a...

متن کامل

Modification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis

Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010